When you’re in a city full of high-risers and densely packed apartments, its easy to forget about the importance of having trees populated in every street. They provide many benefits to the city and its residents. They reduce air pollution, cool sidewalks with shade, reduce toxic runoff, and add some extra wildlife to our cities. Thankfully, we have official organizations like the NYC Parks & Recreation center to track and take care of trees throughout our city.
In this project, I will be using two different datasets from NYC Open Data and The Department of City Planning. The former will be used to analyze all sorts of information regarding trees planted throughout NYC, and the latter for constructing maps.
Data Acquisition and Preparation
###Packages
As usual, multiple packages will be used for this project. A few extra packages are being used for this project, such as sf, which allows spacial data to be read, and perform spatial operations.
Let’s download the City council districts from The Department of City Planning website. Essentially, this dataset contains the boundaries of every single district in NYC. This is crucial for future analysis, as majority of the exploratory analysis will be conducted through geospacial data analysis, essentially making maps.
As mentioned before, we will be using the sf package to download the data correctly, with the function st_read. In order to avoid repeatably downloading the data, I made the choice to download the data as a function to check if the data has already been downloaded, and only download it if it hasn’t. Since this file is hosted as a static file, we can quickly download it and make a local directory to store the data
Show the code
#| echo: true#| include: true#| output: falsedownload_nycc_boundaries <-function(url ="https://s-media.nyc.gov/agencies/dcp/assets/files/zip/data-tools/bytes/city-council/nycc_25c.zip") {# Set destination folder and file paths dest_dir <-file.path("data", "mp03") zip_file <-file.path(dest_dir, basename(url))# Create directory if it doesn't existif (!dir.exists(dest_dir)) {dir.create(dest_dir, recursive =TRUE) }# Download zip file if it doesn't existif (!file.exists(zip_file)) {download.file(url, destfile = zip_file, mode ="wb") }# List contents of the zip to find the .shp file zip_contents <-unzip(zip_file, list =TRUE) shp_file_rel <- zip_contents$Name[grepl("\\.shp$", zip_contents$Name, ignore.case =TRUE)]if (length(shp_file_rel) ==0L) {stop("No .shp file found in the zip archive.") }# Use the first .shp file found shp_file_rel <- shp_file_rel[1] shp_file <-file.path(dest_dir, shp_file_rel)# Unzip only if shapefile doesn't existif (!file.exists(shp_file)) {unzip(zip_file, exdir = dest_dir) }# Read and transform shapefile sf_data <-st_read(shp_file, quiet =TRUE) sf_data_wgs84 <-st_transform(sf_data, crs ="WGS84")return(sf_data_wgs84)}nyc_boundaries <-download_nycc_boundaries()
NYC Open Data
The next dataset will be from Forestry Tree Points on the NYC Open Data site from the provided API. However, downloading this dataset like the last one leads to downloading the first 100 trees. This is a problem, because we need all of the data to accurately analyze the data. To fix this problem, I increased the limit of downloads from 1,000 to 100,000 rows. I then found out that the dataset is much larger than that, so i made a plan to make another function. The main purpose of it is to repeatedly download the data at batches of 100,000 rows, and eventually stop if the amount of rows it downloads is less than the batch amount. Finally, we bind all of the rows together create one complete dataset.
With 11 files created in this download, this means that there are likely more than one million trees in this city! There one thing to note, this dataset does not include parks; the amount of trees in NYC would likely be even higher than before. This site also has a data dictionary for us to reference if we are unsure what a variable means.
Show the code
# Function to download NYC tree points data in chunks and combine into one sf objectdownload_nyc_tree_points <-function(base_url ="https://data.cityofnewyork.us/resource/hn5i-inap.geojson",limit =100000# Number of rows to fetch per request) {# Directory where downloaded files will be saved dest_dir <-file.path("data", "mp03")# Initialize offset for pagination (start at 0) offset <-0# List to store all downloaded sf objects all_files <-list()# Repeat until no more data is returnedrepeat {# Convert offset to string (avoid scientific notation) offset_str <-format(offset, scientific =FALSE)# Create file name for this chunk based on offset file_name <-glue("tree_points_{offset_str}.geojson") file_path <-file.path(dest_dir, file_name)# If file does not exist locally, download itif (!file.exists(file_path)) {# Build the request using httr2 req <-request(base_url) |>req_url_query(`$limit`= limit, `$offset`= offset_str) |># Add query params for paginationreq_user_agent("Baruch College Mini-Project for Student, API access") |>req_perform()# Save raw response body to filewriteBin(resp_body_raw(req), file_path) }# Try reading the GeoJSON file into an sf object sf_data <-tryCatch({st_read(file_path, quiet =TRUE) # Read spatial data quietly }, error =function(e) {# If reading fails, print message and return NULLmessage(glue("Failed to read file at offset {offset_str}: {e$message}"))return(NULL) })# If no data returned or read failed, break the loopif (is.null(sf_data) ||nrow(sf_data) ==0) {break }# Normalize planteddate column to character (avoid issues with mixed types)if ("planteddate"%in%names(sf_data)) { sf_data$planteddate <-as.character(sf_data$planteddate) }# Append this chunk to the list all_files[[length(all_files) +1]] <- sf_data# If fewer rows than limit were returned, stopif (nrow(sf_data) < limit) {break }# Otherwise, increase offset and continue offset <- offset + limit }# Combine all chunks into one data frame combined_data <-bind_rows(all_files)# Return the combined sf objectreturn(combined_data)}tree_points <-download_nyc_tree_points()
Exploratory Analysis
Mapping NYC Trees
As said before, there are a lot of trees in NYC, likely more than a million. A simple map plotting points of all trees would not work well, dots would overlap each other, and overall just make a big green blob. Changing the alpha or size wouldn’t help much either. So, I went the route of using leaflet instead of ggplot for this specific map. Here, you can zoom in and click to see the species and overall health of the tree.
In order to conduct further analysis on NYC trees, I joined the Tree Points dataset onto the District Boundaries using st_join and st_contains, which allowed me to perform a spacial join. However, loading issues appear with the geometry whenever there are summarizations in a code block. A quick fix is to use st_drop_geometry, and either join it back using a regular join function later, or call the variable later on in the plot.
Show the code
joined <-st_join(nyc_boundaries, tree_points_samp, join = st_contains)highlight <- joined |>group_by(CounDist) |>st_drop_geometry() |>#removing geometry for quick computationssummarise(num_trees =n()) |>slice_max(num_trees, n =1)# Plotjoined |>group_by(CounDist) |>mutate(num_trees =n()) |>slice(1) |># keep one polygon per districtungroup() |>ggplot() +geom_sf(aes(fill = num_trees), color ="white") +# highlighting district with highest num of treesgeom_sf(data =filter(joined, CounDist == highlight$CounDist),fill =NA, color ="red", size =1.2)+scale_fill_gradient(low ="lightgreen", high ="darkgreen") +theme_void() +labs(title ="Number of Trees per District",caption =glue("District {highlight$CounDist} (highlighted in red) has the most trees ({highlight$num_trees})"),fill ="Number of Trees" )
Which council district has the highest density of trees?
This plot is similar to the previous one, with the difference being tree density instead of the amount of trees. However, districts vary in size, and larger districs will tend to have more trees because of this. For example, district 51
Having more trees does not automatically mean the district is doing well, as the district could proportionally be larger to other districts, and have the same amount of trees per each block. Checking by tree density means there are more trees overall for the size
Show the code
#Find the densities of each areatree_density <- joined |>st_drop_geometry() |>#removing geometry for quick computationsgroup_by(CounDist) |>summarise(n =n(),area =first(Shape_Area), #fixing an issue with summarizing .groups ="drop" ) |>mutate(trees_per_mi =round(n / (area/27878400), 1) # ft² -> mi² )#Join back to geometry; keep all districtstree_density <- nyc_boundaries |>select(CounDist, geometry, Shape_Area) |>left_join(tree_density, by ="CounDist")# Compute the district with the max density max_density<- tree_density |>st_drop_geometry() |>filter(trees_per_mi ==max(trees_per_mi)) |>slice(1)# Plot ggplot(tree_density) +geom_sf(aes(fill = trees_per_mi), color ="white", linewidth =0.2) +geom_sf(data =filter(joined, CounDist == max_density$CounDist),fill =NA, color ="red", size =1.2)+scale_fill_gradient(low ="lightgreen",high ="darkgreen",name =expression("Trees per mi"^2) ) +labs(title ="Tree Density per District",caption =paste0("District ", max_density$CounDist, " has the highest tree density, with ", max_density$trees_per_mi," Trees per mi²")# no `fill =` here—legend title already set in the scale ) +theme_void() +theme(plot.title =element_text(face ="bold", hjust =0.5),legend.position ="right" )
Show the code
joined |>st_drop_geometry() |>select(tpcondition)|>distinct(tpcondition)
tpcondition
1 Fair
1.1 Good
1.5 Excellent
1.8 Unknown
1.15 Poor
1.60 Dead
2.30 Critical
Show the code
dead_ratio<- joined|>st_drop_geometry()|>#removing geometry for quick computationsgroup_by(CounDist)|>summarise(total =n(),n_dead=length(which(tpcondition=="Dead")),dead_frac = n_dead/total,.groups ="drop" )#Joining back on Geometrydead_ratio <- nyc_boundaries |>select(CounDist, geometry, Shape_Area) |>left_join(dead_ratio , by ="CounDist")max_dead<- dead_ratio |>st_drop_geometry() |>filter(dead_frac ==max(dead_frac)) |>slice(1)dead_ratio|>ggplot()+geom_sf(aes(fill = dead_frac), color ="white", linewidth =0.2)+scale_fill_gradientn(colours =c( "#FFFFCC","darkgreen", "#4D004B"), # light yellow → green → dark purplevalues = scales::rescale(c(0, 0.5, 1)),name ="Dead Tree Ratio")+theme_void()+labs(title ="Fraction of Dead Trees per district",caption =paste0("District ", max_dead$CounDist," has the highest ratio of dead trees (", round(max_dead$dead_frac, 2), ")."))+theme(plot.title =element_text(face ="bold", hjust =0.5),legend.position ="right" )
Show the code
tree_borough<- joined |>mutate(Borough=case_when( CounDist >=1& CounDist <=10~"Manhattan", CounDist >=11& CounDist <=18~"Bronx", CounDist >=19& CounDist <=32~"Queens", CounDist >=33& CounDist <=48~"Brooklyn", CounDist >=49& CounDist <=51~"Staten Island" ) )tree_borough|>st_drop_geometry()|>filter(Borough=="Manhattan")|>group_by(genusspecies)|>summarize(number_of_spec=n())|>arrange(number_of_spec)|>ungroup()|>mutate(genusspecies=str_to_title(str_extract(genusspecies, "[^-]+$")),Percentage =percent(number_of_spec /sum(number_of_spec), accuracy= .01) )|>slice_max(order_by= number_of_spec, n=5)|>mutate(number_of_spec=comma(number_of_spec))|>rename(`Species`= genusspecies, `Amount of Trees`=number_of_spec)|>kbl(align =c("l","c"), caption ="Top 5 Common Tree Species in Manhattan")|>kable_minimal(c("hover", "condensed", "responsive"), full_width = F, font_size=15)|>column_spec (1, color="darkgreen")|>column_spec(2, bold =TRUE)|>column_spec(3,color="gray40")
Top 5 Common Tree Species in Manhattan
Species
Amount of Trees
Percentage
Thornless Honeylocust
175
14.23%
London Planetree
111
9.02%
Callery Pear
78
6.34%
Pin Oak
77
6.26%
Maidenhair Tree
70
5.69%
Show the code
# Baruch College point (lon, lat)baruch_point <-st_sfc(st_point(c(-73.9836, 40.7402)), crs =4326)# Transform both to projected CRS for accurate distance in feettree_points_proj <-st_transform(tree_points_samp, 2263) # NYC State Plane (feet)baruch_proj <-st_transform(baruch_point, 2263)# Compute distances and get top 5 closest treestree_points_proj |>mutate(distance =as.numeric(st_distance(geometry, baruch_proj))) |># convert to numericarrange(distance) |>slice_head(n =5) |>mutate(Species =str_to_title(str_extract(genusspecies, "[^-]+$")),distance =round(distance, 1) # clean numeric, no units ) |>select(Species, Condition = tpcondition, `Distance (in ft)`= distance)|>st_drop_geometry()|>kbl(align =c("l", "c", "c"),caption ="Top 5 Closest Trees to Baruch College") |>kable_minimal(c("hover", "condensed", "responsive"),full_width =FALSE, font_size =15) |>column_spec(1, color ="darkgreen") |>column_spec(3, bold =TRUE)
Top 5 Closest Trees to Baruch College
Species
Condition
Distance (in ft)
Chestnut Oak
Good
611.1
Unknown
Dead
613.9
Maidenhair Tree
Fair
709.0
Caucasian Linden
Excellent
904.5
Maidenhair Tree
Fair
1007.4
Remember that code blocks look like this:
Show the code
# Your code goes in chunks like thisensure_package(tidyverse) # You will want this line for almost all MPs
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ readr 2.1.5
✔ lubridate 1.9.4 ✔ tibble 3.3.0
✔ purrr 1.1.0 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ readr::col_factor() masks scales::col_factor()
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter() masks stats::filter()
✖ kableExtra::group_rows() masks dplyr::group_rows()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Show the code
x <-1+2+3
and you can then print variables from chunks:
Show the code
x
[1] 6
or inline, like this: \(x\) is 6.
Final Insights and Deliverable
Code related to the final deliverable of the assignment goes here.